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Abstract. Non-negative sparse coding is a method for decompos- 
ing multivariate data into non-negative sparse components. In this 
paper we briefly describe the motivation behind this type of data 
representation and its relation to standard sparse coding and non- 
negative matrix factorization. We then give a simple yet efficient 
multiplicative algorithm for finding the optimal values of the hid- 
den components. In addition, we show how the basis vectors can 
be learned from the observed data. Simulations demonstrate the 
effectiveness of the proposed method. 



INTRODUCTION 

Linear data representations are widely used in signal processing and data 
analysis. A traditional method of choice for signal representation is of course 
Fourier analysis, but also wavelet representations are increasingly being used 
in a variety of applications. Both of these methods have strong mathematical 
foundations and fast implementations, but they share the important draw- 
back that they are not adapted to the particular data being analyzed. 

Data-adaptive representations, on the other hand, are representations 
that are tailored to the statistics of the data. Such representations are learned 
directly from the observed data by optimizing some measure that quantifies 
the desired properties of the representation. This class of methods include 
principal component analysis (PCA), independent component analysis (1CA), 
sparse coding, and non- negative matrix factorization (NMF). Some of these 
methods have their roots in neural computation, but have since been shown 
to be widely applicable for signal analysis. 

In this paper we propose to combine sparse coding and non-negative ma- 
trix factorization into non-negative sparse coding (NNSC). Again, the moti- 
vation comes partly from modeling neural information processing. We believe 
that, as with previous methods, this technique will be found useful in a more 
general signal processing framework. 



NON-NEGATIVE SPARSE CODING 

Assume that we observe data in the form of a large number of i.i.d. random 
vectors x„, where n is the sample index. Arranging these into the columns 
of a matrix X, then linear decompositions describe this data as X w AS. 
The matrix A is called the mixing matrix, and contains as its columns the 
basis vectors (features) of the decomposition. The rows of S contain the 
corresponding hidden components that give the contribution of each basis 
vector in the input vectors. Although some decompositions provide an exact 
reconstruction of the data (i.e. X = AS) the ones that we shall consider here 
are approximative in nature. 

In linear sparse coding || , the goal is to find a decomposition in which 
the hidden components are sparse, meaning that they have probability den- 
sities which are highly peaked at zero and have heavy tails. This basically 
means that any given input vector can be well represented using only a few 
significantly non-zero hidden coefficients. Combining the goal of small re- 
construction error with that of sparseness, one can arrive at the following 
objective function to be minimized j^, ||: 



where the squared matrix norm is simply the summed squared value of the 
elements, i.e. ||X — AS|| 2 = J^V ;[Xy — (AS)^] 2 . The tradeoff between sparse- 
ness and accurate reconstruction is controlled by the parameter A, whereas 
the form of / defines how sparseness is measured. To achieve a sparse code, 
the form of / must be chosen correctly: A typical choice is f(s) = \s\, al- 
though often similar functions that exhibit smoother behaviour at zero are 
chosen for numerical stability. 

There is one important problem with this objective: As / typically is a 
strictly increasing function of the absolute value of its argument, the objective 
can always be decreased by simply scaling up A and correspondingly scaling 
down S. The consequences of this is that optimization of ([l]) with respect to 
both A and S leads to the elements of A growing (in absolute value) without 
bounds whereas S tends to zero. More importantly, the solution found does 
not depend on the second term of the objective as it can always be eliminated 
by this scaling trick. In other words, some constraint on the scales of A or S 
is needed. Olshausen and Field || used an adaptive method to ensure that 
the hidden components had unit variance (effectively fixing the norm of the 
rows of S), whereas Harpur |lj fixed the norms of the columns of A. 

With either of the above scale constraints the objective (|l|) is well-behaved 
and its minimization can produce useful decompositions of many types of 
data. For example, it was shown in Q that applying this method to image 
data yielded features closely resembling simple-cell receptive fields in the 
mammalian primary visual cortex. The learned decomposition is also similar 
to wavelet decompositions, implying that it could be useful in applications 
where wavelets have been successfully applied. 




(1) 



In standard sparse coding, described above, the data is described as a 
combination of elementary features involving both additive and subtractive 
interactions. The fact that features can 'cancel each other out' using subtrac- 
tion is contrary to the intuitive notion of combining parts to form a whole || . 
Thus, Lee and Seung || |^| have recently forcefully argued for non- negative 
representations ||. Other arguments for non- negative representations come 
from biological modeling || ||, || , where such constraints are related to the 
non-negativity of neural firing rates. These non-negative representations as- 
sume that the input data X, the basis A, and the hidden components S are 
all non-negative. 

Non-negative matrix factorization^] (NMF) can be performed by the min- 
imization of the following objective function [Q, ||: 

C(A,S)-i||X-AS|| 2 (2) 

with the non-negativity constraints Vij : Ay > 0, Sij > 0. This objective 
requires no constraints on the scales of A or S. 

In |J , the authors showed how non-negative matrix factorization applied 
to face images yielded features that corresponded to intuitive notions of face 
parts: lips, nose, eyes, etc. This was contrasted with the holistic representa- 
tions learned by PCA and vector quantization. 

We suggest that both the non-negativity constraints and the sparseness 
goal are important for learning parts-based representations. Thus, we pro- 
pose to combine these two methods into non-negative sparse coding: 

Definition 1 Non-negative sparse coding (NNSC) of a non-negative data 
matrix X (i.e. Vij : Xy > 0) is given by the minimization of 

C(A,S) = i||X-AS|| 2 + A^Sy (3) 

ij 

subject to the constraints Vij : Ay > 0, Sy > and Vi : ||aj|| = 1, where 
denotes the i:th column of A. It is also assumed that the constant A > 0. 

Notice that we have here chosen to measure sparseness by a linear activation 
penalty (i.e. f(s) = s). This particular choice is primarily motivated by the 
fact that this makes the objective function quadratic in S. This is useful in the 
development and convergence proof of an efficient algorithm for optimizing 
the hidden components S. 



ESTIMATING THE HIDDEN COMPONENTS 

We will first consider optimizing S, for a given basis A. As the objective (||) is 
quadratic with respect to S, and the set of allowed S (i.e. the set where Sy > 

j-Nflte that error measures other than the summed squared error were also considered 



0) is convex, we are guaranteed that no suboptimal local minima exist. The 
global minimum can be found using, for example, quadratic programming 
or gradient descent. Gradient descent is quite simple to implement, but 
convergence can be slow. On the other hand, quadratic programming is 
much more complicated to implement. To address these concerns, we have 
developed a multiplicative algorithm based on the one introduced in that 
is extremely simple to implement and nonetheless seems to be quite efficient. 
This is given by iterating the following update rule: 

Theorem 1 The objective (Q) is nonincreasing under the update rule: 

S t+1 = S* .* (A T X) ./ (A T AS* + A) (4) 

where .* and ./ denote elementwise multiplication and division (respectively), 
and the addition of the scalar A is done to every element of the matrix 
A T AS t . 

This is proven in the Appendix. As each clement of S is updated by simply 
multiplying with some non-negative factor, it is guaranteed that the elements 
of S stay non-negative under this update rule. As long as the initial values 
of S are all chosen strictly positive, iteration of this update rule is in practice 
guaranteed to reach the global minimum to any required precision. 



LEARNING THE BASIS 

In this section we consider optimizing the objective (|3|) with respect to both 
the basis A and the hidden components S, under the stated constraints. 
First, we consider the optimization of A only, holding S fixed. 

Minimizing (^) with respect to A under the non- negativity constraint only 
could be done exactly as in 0, with a simple multiplicative update rule. 
However, the constraint of unit-norm columns of A complicates things. We 
have not found any similarly efficient update rule that would be guaranteed 
to decrease the objective while obeying the required constraint. Thus, we 
here resort to projected gradient descent. Each step is composed of three 
parts: 

1. A' = A* — fi(A t S — X)S T 

2. Any negative values in A' are set to zero 

3. Rescale each column of A' to unit norm, and then set A* +1 = A'. 

This combined step consists of a gradient descent step (Step 1) followed 
by projection onto the closest point satisfying both the non-negativity and 
the unit-norm constraints (Steps 2 and 3). This projected gradient step is 
guaranteed to decrease the objective if the stepsize \i > is small enough and 
we are not already at a local minimum. (In this case there is no guarantee 
of reaching the global minimum, due to the non-convex constraints.) 



In the previous section, we gave an update step for S, holding A fixed. 
Above, we showed how to update A, holding S fixed. To optimize the ob- 
jective with respect to both, we can of course take turns updating A and S. 
This yields the following algorithm: 



Algorithm for NNSC 

1. Initialize A and S° to random strictly positive matrices of the 
appropriate dimensions, and rescale each column of A to unit 
norm. Set t = 0. 

2. Iterate until convergence: 



EXPERIMENTS 

To demonstrate how sparseness can be essential for learning a parts-based 
non-negative representation, we performed a simple simulation where the 
generating features were known. The interested reader can find the code to 
perform these experiments (as well as the experiments reported in Q|) on the 
web at: 



In our simulations, the data vectors were 3x3 -pixel images with non- 
negative pixel values. We manually constructed 10 original features: the six 
possible horizontal and vertical bars, and the four possible horizontal and 
vertical double bars. Each feature was normalized to unit norm, and entered 
as a column in the matrix A or j g . The features are shown in the leftmost panel 
of Figure |l|. We then generated random sparse non-negative data S or j g , and 
obtained the data vectors as X = A £jgS or ;g. A random sample of 12 such 
data vectors are also shown in Figure ET 

We ran NNSC and NMF on this data X. With 10 hidden components 
(rows of S), NNSC can correctly identify all the features in the dataset. This 
result is shown in Figure [l] under NNSC. However, NMF cannot find all the 
features with any hidden dimensionality. With 6 components, NMF finds 
all the single bar features. With a dimensionality of 10, not even all of the 
single bars are correctly estimated. These results are illustrated in the two 
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Figure 1: Experiments on bars data. Features: The 10 original features that were 
used to construct the dataset. Data: A random sample of 12 data vectors. These 
constitute superpositions of the original features. NNSC: Features learned by 
NNSC, with dimensionality of the hidden representation equal to 10, starting from 
random initial values. NMF (6): Features learned by NMF, with dimensionality 6. 
NMF (10): Features learned by NMF, with dimensionality 10. See main text for 
discussion. 

rightmost panels of Figure |l|. 

It is not difficult to understand why NMF cannot learn all the features. 
The data X can be perfectly described as an additive combination of the 
six single bars (because all double bars can be described as two single bars). 
Thus, NMF essentially achieves the optimum (zero reconstruction error) al- 
ready with 6 features, and there is no way in which an overcomplete repre- 
sentation could improve that. However, when sparseness is considered as in 
NNSC, it is clear that it is useful to have double bar features because these 
allow a sparser description of such data patterns. 

In addition to these simulations, we have performed experiments with 
natural image data, reported elsewhere || ||. These confirm our belief 
that sparseness is important when learning non-negative representations from 
data. 



RELATION TO OTHER WORK 

In addition to the tight connection to linear sparse coding || || and non- 
negative matrix factorization |^|, 7, this method is intimately related to 
independent component analysis [5]. In fact, when the fixed-norm constraint 
is placed on the rows of S instead of the columns of A, the objective (^) 
could be directly interpreted as the negative joint log-posterior of the basis 
vectors and components, given the data X, in the noisy ICA model 0. This 
connection is valid when the independent components are assumed to have 
exponential distributions, and of course the basis vectors are assumed to be 
non-negative as well. 

Other researchers have also recently considered the constraint of non- 
negativity in the context of ICA. In particular, Plumbley Jll| has considered 
estimation of the noiseless ICA model (with equal dimensionality of com- 
ponents and observations) in the case of non-negative components. On the 



other hand, Parra et al. jl(J considered estimation of the ICA model where 
the basis (but not the components) was constrained to be non-negative. The 
main novelty of the present work is the application of the non-negativity con- 
straints in the sparse coding framework, and the simple yet efficient algorithm 
developed to estimate the components. 



CONCLUSIONS 

In this paper, we have defined non-negative sparse coding as a combination 
of sparse coding with the constraints of non-negative matrix factorization. 
Although this is essentially a special case of the general sparse coding frame- 
work, we believe that the proposed constraints can be important for learning 
parts-based representations from non-negative data. In addition, the con- 
straints allow a very simple yet efficient algorithm for estimating the hidden 
components. 



APPENDIX 

To prove Theorem [j], first note that the objective (^) is separable in the 
columns of S so that each column can be optimized without considering the 
others. We may thus consider the problem for the case of a single column, 
denoted s. The corresponding column of X is denoted x, giving the objective 

F(s) = i||x-As|| 2 + A]>>. (5) 

i 

The proof will follow closely the proof given in [[?] for the case A = 0. 
(Note that in [Q, the notation v = x, W = A and h = s was used.) We 
define an auxiliary function G(s,s 4 ) with the properties that G(s,s) = F(s) 
and G(s,s*) > F(s). We will then show that the multiplicative update rule 
corresponds to setting, at each iteration, the new state vector to the values 
that minimize the auxiliary function: 

s* +1 = argminG(s,s t ). (6) 

s 

This is guaranteed not to increase the objective function F, as 

F(s t+1 ) < G(s t+1 , s*) < G(s\ s 4 ) = F(s'). (7) 
Following |Q, we define the function G as 

G(s, s*) = F(s*) + (s - s t ) T V J F(s t ) + hs - s t ) T K(s t )(s - s 4 ) (8) 
where the diagonal matrix K(s*) is defined as 

ts t t\ x (A T As*) Q + A 

K ab (s t ) = S ab / . (9) 



It is important to note that the elements of our choice for K are always 
greather than or equal to those of the K used in , which is the case where 
A = 0. It is obvious that G(s,s) = F(s). Writing out 

F(s) = F(s*) + (s - s*) T VF(s t ) + -(s - s*) T (A T A)(s - s*), (10) 

we see that the second property, G(s,s') > F(s), is satisfied if 

0< (s-s*) T [K(s t ) - A T A](s-s t ). (11) 

Lee and Seung proved this positive semidefiniteness for the case of A = 
Q. In our case, with A > 0, the matrix whose positive semidefiniteness is 
to be proved is the same except that a strictly non-negative diagonal matrix 
has been added (see the above comment on the choice of K). As a non- 
negative diagonal matrix is positive semidefinite, and the sum of two positive 
semidefinite matrices is also positive semidefinite, the A = proof in Q| also 
holds when A > 0. 

It remains to be shown that the update rule in (|]) selects the minimum 
of G. This minimum is easily found by taking the gradient and equating it 
to zero: 

V s G(s, s 4 ) = A T (As* - x) + Ac + K(s*)(s - s*) = 0, (12) 

where c is a vector with all ones. Solving for s, this gives 

s = s* -K- 1 (s*)(A T As t - A T x + Ac) (13) 
= s*-(s t ./(A T As t + Ac)).*(A T As t -A T x + Ac) (14) 
= s*.*(A T x)./(A T As t + Ac)) (15) 

which is the desired update rule (0). This completes the proof. 
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